Exploiting a Proximity-based Positional Model to Improve the Quality of Information Extraction by Text Segmentation
نویسندگان
چکیده
A large number of web pages contain information of entities in a form of lists of field values. Those implicit semi-structured records are often available in textual sources on the web such as advertisings of products, postal addresses, bibliographic information, etc. Harvesting information of those entities from such lists of field values is challenge task because the lists are manually generated, not written in a well-defined templates or may miss some information. In this paper, we introduce a proximity-based positional model (PPM) to improve the quality of extracting information by text segmentation. Our proposed model offers improvements over the fixed-positional model proposed in ONDUX, a current state-of-art method for information extraction by text segmentation (IETS) to revise the labels of text segments in an input list of field values. Different from fixed-positional model in previous work, the key idea of PPM is to define proximity heuristic for labels in an input list in a unified language model. Our proposed model is estimated based on propagated counts of labels through a proximity-based density function. We propose and study several density functions and experimental results on different domains show that PPM is effective to revise labels and helps to improve performance of current state-of-art method.
منابع مشابه
Document Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کاملA Modified Character Segmentation Algorithm for Farsi Printed Text Using Upper Contour Labelling
In this paper, a modified segmentation algorithm for printed Farsi words is presented. This algorithm is based on a previous work by Azmi that uses the conditional labeling of the upper contour to find the segmentation points. The main objective is to improve the segmentation results for low quality prints. To achieve this, various modifications on local baseline detection, contour labeling an...
متن کاملA Modified Character Segmentation Algorithm for Farsi Printed Text Using Upper Contour Labelling
In this paper, a modified segmentation algorithm for printed Farsi words is presented. This algorithm is based on a previous work by Azmi that uses the conditional labeling of the upper contour to find the segmentation points. The main objective is to improve the segmentation results for low quality prints. To achieve this, various modifications on local baseline detection, contour labeling an...
متن کاملCluster-Based Image Segmentation Using Fuzzy Markov Random Field
Image segmentation is an important task in image processing and computer vision which attract many researchers attention. There are a couple of information sets pixels in an image: statistical and structural information which refer to the feature value of pixel data and local correlation of pixel data, respectively. Markov random field (MRF) is a tool for modeling statistical and structural inf...
متن کاملA New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model
Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...
متن کامل